




! pip3 install numpy
! pip3 install pandas
! pip3 install matplotlib
! pip3 install seaborn
! pip3 install scikit-learn
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: numpy in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.23.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pandas in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (2.0.0) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3) Requirement already satisfied: numpy>=1.20.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: matplotlib in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.0.7) Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (4.39.3) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: numpy>=1.20 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (1.23.0) Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (9.5.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib) (5.12.0) Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib) (3.15.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: seaborn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (1.23.0) Requirement already satisfied: pandas>=0.25 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (2.0.0) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from seaborn) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7) Requirement already satisfied: cycler>=0.10 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.3) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1) Requirement already satisfied: pillow>=6.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.5.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0) Requirement already satisfied: pytz>=2020.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: zipp>=3.1.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0) Requirement already satisfied: six>=1.5 in /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.15.0) Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: scikit-learn in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.23.0) Requirement already satisfied: scipy>=1.3.2 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.9.1) Requirement already satisfied: joblib>=1.1.1 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/petercatania/Library/Python/3.9/lib/python/site-packages (from scikit-learn) (3.1.0)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')
%matplotlib inline
df = pd.read_csv("Diamonds Prices2022.csv")
df.sample(10)
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 24640 | 24641 | 2.13 | Premium | I | SI2 | 59.8 | 58.0 | 12980 | 8.43 | 8.37 | 5.02 |
| 43560 | 43561 | 0.51 | Good | D | VS2 | 64.2 | 58.0 | 1430 | 5.01 | 5.06 | 3.23 |
| 39955 | 39956 | 0.33 | Good | D | SI2 | 63.4 | 56.0 | 492 | 4.40 | 4.43 | 2.80 |
| 22798 | 22799 | 1.22 | Ideal | E | VS1 | 61.8 | 56.0 | 10823 | 6.84 | 6.88 | 4.24 |
| 13922 | 13923 | 1.01 | Fair | G | VS2 | 67.8 | 59.0 | 5666 | 6.07 | 6.02 | 4.10 |
| 25376 | 25377 | 0.31 | Ideal | E | VS2 | 61.7 | 55.0 | 642 | 4.38 | 4.41 | 2.71 |
| 24687 | 24688 | 1.50 | Ideal | G | VS2 | 62.2 | 58.0 | 13049 | 7.31 | 7.29 | 4.54 |
| 26090 | 26091 | 2.51 | Ideal | H | SI1 | 62.9 | 56.0 | 15324 | 8.65 | 8.61 | 5.43 |
| 23088 | 23089 | 1.57 | Premium | G | SI1 | 59.2 | 59.0 | 11113 | 7.68 | 7.60 | 4.52 |
| 13951 | 13952 | 1.29 | Ideal | J | VS1 | 62.0 | 57.0 | 5676 | 6.92 | 6.98 | 4.31 |
There are some NaN values in the dataset.
# count the numbers of NaN values in each column
df.isnull().sum()
Unnamed: 0 0 carat 0 cut 0 color 0 clarity 0 depth 0 table 0 price 0 x 0 y 0 z 0 dtype: int64
Because the null values are in a column that have no meaning or relations to the other columns, we can drop this column.
# drop first column (unnamed)
df = df.drop(df.columns[0], axis=1)
df.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
# check the data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53943 entries, 0 to 53942 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53943 non-null float64 1 cut 53943 non-null object 2 color 53943 non-null object 3 clarity 53943 non-null object 4 depth 53943 non-null float64 5 table 53943 non-null float64 6 price 53943 non-null int64 7 x 53943 non-null float64 8 y 53943 non-null float64 9 z 53943 non-null float64 dtypes: float64(6), int64(1), object(3) memory usage: 4.1+ MB
Before viewing the correlation between the features, we need to convert the categorical features in numerical features.
Like this:
so less is the value of this features, more is the quality of the diamonds ==> better is the price of the diamonds, but how much?
# convert categorical data to numerical data
cutMap = {'Ideal': 5, 'Premium': 4, 'Very Good': 3, 'Good': 2, 'Fair': 1}
colorMap = {'D': 1, 'E': 2, 'F': 3, 'G': 4, 'H': 5, 'I': 6, 'J': 7}
clarityMap = {'IF': 1, 'VVS1': 2, 'VVS2': 3, 'VS1': 4, 'VS2': 5, 'SI1': 6, 'SI2': 7, 'I1': 8}
df['cut'] = df['cut'].map(cutMap)
df['color'] = df['color'].map(colorMap)
df['clarity'] = df['clarity'].map(clarityMap)
df.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 2 | 7 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 4 | 2 | 6 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 2 | 2 | 4 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 4 | 6 | 5 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 2 | 7 | 7 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
# heatmap that shows the correlation between the different features
sns.heatmap(df.corr())
<Axes: >
# 2 side by side plots of price vs depth for both price vs table
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
axis1.scatter(df['depth'], df['price'])
axis1.set_title('Depth vs Price')
axis1.set_xlabel('Depth')
axis1.set_ylabel('Price')
axis2.scatter(df['depth'], df['price'])
axis2.set_title('Table vs Price')
axis2.set_xlabel('Table')
Text(0.5, 0, 'Table')
#drop depth and table columns
dfCleaned = df.drop(['depth', 'table'], axis=1)
dfCleaned.head()
| carat | cut | color | clarity | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 2 | 7 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 4 | 2 | 6 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 2 | 2 | 4 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 4 | 6 | 5 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 2 | 7 | 7 | 335 | 4.34 | 4.35 | 2.75 |
#custom legend for cut column
cutLegend = {1: 'Fair', 2: 'Good', 3: 'Very Good', 4: 'Premium', 5: 'Ideal'}
# plot carat vs price, with cut hue and custom cutLegend
sns.lmplot(x='carat', y='price', data=dfCleaned, hue='cut', fit_reg=False, legend=False)
plt.legend(cutLegend.values())
<matplotlib.legend.Legend at 0x882440c40>
# put 3 scatter plot aside each other, for x and y and z in relation to the price
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.scatter(dfCleaned['x'], dfCleaned['price'], picker=True)
ax1.set_title('x vs price')
ax1.set_xlabel('x')
ax1.set_ylabel('price')
ax2.scatter(dfCleaned['y'], dfCleaned['price'])
ax2.set_title('y vs price')
ax2.set_xlabel('y')
ax3.scatter(dfCleaned['z'], dfCleaned['price'])
ax3.set_title('z vs price')
ax3.set_xlabel('z')
Text(0.5, 0, 'z')
# convert coulmns to binary values using get_dummies
dfAdjusted = pd.get_dummies(dfCleaned, columns=['cut', 'color', 'clarity'])
dfAdjusted.head()
| carat | price | x | y | z | cut_1 | cut_2 | cut_3 | cut_4 | cut_5 | ... | color_6 | color_7 | clarity_1 | clarity_2 | clarity_3 | clarity_4 | clarity_5 | clarity_6 | clarity_7 | clarity_8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 326 | 3.95 | 3.98 | 2.43 | False | False | False | False | True | ... | False | False | False | False | False | False | False | False | True | False |
| 1 | 0.21 | 326 | 3.89 | 3.84 | 2.31 | False | False | False | True | False | ... | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.23 | 327 | 4.05 | 4.07 | 2.31 | False | True | False | False | False | ... | False | False | False | False | False | True | False | False | False | False |
| 3 | 0.29 | 334 | 4.20 | 4.23 | 2.63 | False | False | False | True | False | ... | True | False | False | False | False | False | True | False | False | False |
| 4 | 0.31 | 335 | 4.34 | 4.35 | 2.75 | False | True | False | False | False | ... | False | True | False | False | False | False | False | False | True | False |
5 rows × 25 columns
# split the data into training and testing sets
X = dfAdjusted.drop('price', axis=1)
y = dfAdjusted['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# standardize the data
scaler = preprocessing.StandardScaler().fit(X_train)
# transform the data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 8)
X_poly = poly_reg.fit_transform(X_train_scaled)
poly_reg.fit(X_poly, y_train)
# Predicting a new result with Polynomial Regression
y_pred = poly_reg.fit_transform(X_test_scaled)
#print accuracy of the model
print(classification_report(y_test, y_pred))
# Logistic Regression
logreg = LogisticRegression(solver='sag', max_iter=1)
logreg.fit(X_train_scaled, y_train)
# Predicting a new result with Logistic Regression
y_pred = logreg.predict(X_test_scaled)
#print accuracy of the model
print(classification_report(y_test, y_pred))
# Ridge regression
from sklearn.linear_model import Ridge
# SAG is a stochastic optimization algorithm that is particularly useful for large-scale linear regression problems.
# SAGA is a variant of SAG that also supports the non-smooth penalty=l1
#ridgeReg = Ridge(alpha=0.05, solver='sag')
ridgeReg = Ridge(alpha=0.05, solver='saga')
ridgeReg.fit(X_train_scaled, y_train)
# Predicting a new result with Ridge Regression
y_pred = ridgeReg.predict(X_test_scaled)
#print accuracy of the model
print(ridgeReg.score(X_test_scaled, y_test))
0.92242537668462
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]
ridge = Ridge()
parameters = {'alpha': alpha}
# GridSearchCV will try all the combinations of the parameters
ridge_regressor = GridSearchCV(ridge, parameters,scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(X_train_scaled, y_train)
print(ridge_regressor.best_params_)
print(-ridge_regressor.best_score_)
{'alpha': 20}
1295574.3904235393
ridge = Ridge(alpha=ridge_regressor.best_params_['alpha'])
ridge.fit(X_train_scaled, y_train)
# Predicting a new result with Ridge Regression
y_pred = ridge.predict(X_test_scaled)
#print accuracy of the model
print(ridge.score(X_test_scaled, y_test))
0.922375573729045
#Lasso regression
from sklearn.linear_model import Lasso
lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train_scaled, y_train)
# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test_scaled)
#print accuracy of the model
print(lassoReg.score(X_test_scaled, y_test))
0.9223987609445108
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x', 'y', 'z']]
X_test = X_test_scaled[['x', 'y', 'z']]
# parameters that we want to tune
alpha = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 0.6, 1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 300, 500, 700, 1000, 1500, 2000]
lassoReg = Lasso(alpha=0.6)
lassoReg.fit(X_train, y_train)
# Predicting a new result with Lasso Regression
y_pred = lassoReg.predict(X_test)
#print accuracy of the model
print(lassoReg.score(X_test, y_test))
0.7804924706294081
lasso = Lasso()
parameters = {'alpha': alpha}
# GridSearchCV will try all the combinations of the parameters
lasso_regressor = GridSearchCV(lasso, parameters,scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X_train, y_train)
print(lasso_regressor.best_params_)
print(-lasso_regressor.best_score_)
lasso = Lasso(alpha=lasso_regressor.best_params_['alpha'])
lasso.fit(X_train, y_train)
# Predicting a new result with Lasso Regression
y_pred = lasso.predict(X_test)
#print accuracy of the model
print(lasso.score(X_test, y_test))
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.984e+10, tolerance: 5.507e+07 model = cd_fast.enet_coordinate_descent( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.878e+10, tolerance: 5.427e+07 model = cd_fast.enet_coordinate_descent( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.920e+10, tolerance: 5.491e+07 model = cd_fast.enet_coordinate_descent( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.896e+10, tolerance: 5.475e+07 model = cd_fast.enet_coordinate_descent( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.984e+10, tolerance: 5.477e+07 model = cd_fast.enet_coordinate_descent( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.786e+10, tolerance: 5.475e+07 model = cd_fast.enet_coordinate_descent(
{'alpha': 50}
3462425.630456732
0.7803486557847755
#importing the SGD Regressor
from sklearn.linear_model import SGDRegressor
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train_scaled, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test_scaled)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test_scaled, y_test))
0.9222990712101318
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['x']]
X_test = X_test_scaled[['x']]
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.780279120075163
# modify train and test data, with only x,y,z
X_train = X_train_scaled[['carat']]
X_test = X_test_scaled[['carat']]
# Fitting SGD Regressor to the Training set
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)
# Predicting a new result with SGD Regressor
y_pred = sgd_reg.predict(X_test)
#accuracy of the model SGD Regressor
print(sgd_reg.score(X_test, y_test))
0.7805163504962933
We Can see that the accuracy obtained train only with x or carat is fair, but obviusly is not the best model.
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeRegressor
dtree = DecisionTreeRegressor()
dtree.fit(X_train_scaled, y_train)
# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)
#print accuracy of the model
print(dtree.score(X_test_scaled, y_test))
0.9649467463950754
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, random_state=0)
rfr.fit(X_train_scaled, y_train)
# Predicting a new result with Random Forest Regression
y_pred = rfr.predict(X_test_scaled)
#print accuracy of the model
print(rfr.score(X_test_scaled, y_test))
0.9784884722052908
# Support Vector Machine model
from sklearn.svm import SVR
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)
# Predicting a new result with Support Vector Machine
y_pred = svr_model.predict(X_test_scaled)
#print accuracy of the model
print(svr_model.score(X_test_scaled, y_test))
0.36438475571860673
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Feature Scaling
sc = StandardScaler()
# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)
#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy: 0.4326628973954954
error = []
# KNeighborsClassifier works well with not to much feautures, up to 4 or 5 works really well
# Calculating error for K values between 1 and 40
for i in range(1, 60):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test)) # calculate mean error
plt.figure(figsize=(12, 6))
plt.plot(range(1, 60), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
Text(0, 0.5, 'Mean Error')
# KNN model for classification of the color
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_scaled, y_train)
# Predicting a new result with KNN
y_pred = knn.predict(X_test_scaled)
#print accuracy of the model
print('KNN Classification Accurancy: ',knn.score(X_test_scaled, y_test))
KNN Classification Accurancy: 0.44758550375382333
# SVM model for classification of the color
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)
#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy: 0.3821484845676152
# best SUPPORT VECTOR MACHINE
from sklearn.svm import SVC
svc_model = SVC(C=1, gamma=0.1)
svc_model.fit(X_train_scaled, y_train)
# Predicting a new result with SVM
y_pred = svc_model.predict(X_test_scaled)
#print accuracy of the model
print('SVM Classification Accurancy: ',svc_model.score(X_test_scaled, y_test))
SVM Classification Accurancy: 0.3745481508944295
# Decision Tree
# Descision tree is a non-parametric supervised learning method used for classification and regression.
# is
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train_scaled, y_train)
# Predicting a new result with Decision Tree
y_pred = dtree.predict(X_test_scaled)
#print accuracy of the model
print('Decision Tree Classification Accurancy: ',dtree.score(X_test_scaled, y_test))
Decision Tree Classification Accurancy: 0.5130225229400315
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train_scaled, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test_scaled)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test_scaled, y_test))
Random Forest Classification Accurancy: 0.17582723143942905
# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
<Axes: >
X = X_train_scaled.drop(['carat','clarity','cut'], axis=1)
X_test = X_test_scaled.drop(['carat','clarity','cut'], axis=1)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))
Random Forest Classification Accurancy: 0.1738808045231254
# add volume of the diamond V = (4/3) x π x (r1 x r2 x r3)
dfCleaned['volume'] = (4/3) * np.pi * (dfCleaned['x'] * dfCleaned['y'] * dfCleaned['z'])
# add density of the diamond
dfCleaned['density'] = dfCleaned['carat'] / dfCleaned['volume']
# add price per carat
dfCleaned['price_per_carat'] = dfCleaned['price'] / dfCleaned['carat']
# add price per volume
dfCleaned['price_per_volume'] = dfCleaned['price'] / dfCleaned['volume']
# Splitting the dataset into the Training set and Test set
X = dfCleaned.drop('color', axis=1)
y = dfCleaned['color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Feature Scaling
sc = StandardScaler()
# Drop the new columns before scaling
X_train = X_train.drop(['density', 'price_per_volume'], axis=1)
X_test = X_test.drop(['density', 'price_per_volume'], axis=1)
# transform the data
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
# create dataframe from the scaled data
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
# Predicting a new result with Random Forest
y_pred = rfc.predict(X_test)
#print accuracy of the model
print('Random Forest Classification Accurancy: ',rfc.score(X_test, y_test))
# importance of the features in the Random Forest model
feature_imp = pd.Series(rfc.feature_importances_, index=X_train.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
Random Forest Classification Accurancy: 0.6272129020298453
<Axes: >
sns.pairplot(dfCleaned, hue='color', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x17df6ec40>
sns.pairplot(dfCleaned, hue='clarity', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x17df5a970>
sns.pairplot(dfCleaned, hue='cut', palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x29f2cfa60>
# Clustering with K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(dfCleaned.drop('clarity', axis=1))
# view the cluster centeroids
print(kmeans.cluster_centers_)
# view the labels
print(kmeans.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_
# plot the clusters
sns.lmplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', aspect=1, fit_reg=False)
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
[[1.11558167e+00 3.72150251e+00 3.91133390e+00 5.77562945e+03 6.61673133e+00 6.61299655e+00 4.08711087e+00] [1.71679512e+00 3.89623475e+00 4.15290790e+00 1.33434771e+04 7.63437511e+00 7.63600495e+00 4.69954216e+00] [4.91864237e-01 3.99061503e+00 3.35012908e+00 1.45603620e+03 4.99099742e+00 4.99797844e+00 3.08344024e+00]] [2 2 2 ... 2 2 2]
<seaborn.axisgrid.FacetGrid at 0x8807c1a30>
kmeans = KMeans(n_clusters=8)
kmeans.fit(dfCleaned.drop('clarity', axis=1))
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = kmeans.labels_
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('KMeans')
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1, legend=True)
ax2.set_title("Original")
sns.scatterplot(x='x', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2, legend=True)
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
<matplotlib.legend.Legend at 0x87ad14910>
# Clustering models: K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Model (GMM), Mean Shift, Spectral Clustering, Affinity Propagation,
# Agglomerative Clustering, Birch, Mini-Batch K-Means, OPTICS, and more.
# Clastering with Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=8)
agg.fit(dfCleaned.drop('clarity', axis=1))
# view the labels
print(agg.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = agg.labels_
[5 5 5 ... 0 0 0]
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('AgglomerativeClustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)
ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>
# Clustering with Mini-Batch K-Means
from sklearn.cluster import SpectralClustering
mbk = SpectralClustering(n_clusters=8)
mbk.fit(dfCleaned.drop('clarity', axis=1))
# view the labels
print(mbk.labels_)
# add a new column to the dataframe with the cluster labels
dfCleaned['cluster'] = mbk.labels_
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn(
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[39], line 5 2 from sklearn.cluster import SpectralClustering 4 mbk = SpectralClustering(n_clusters=8) ----> 5 mbk.fit(dfCleaned.drop('clarity', axis=1)) 7 # view the labels 8 print(mbk.labels_) File ~/Library/Python/3.9/lib/python/site-packages/sklearn/cluster/_spectral.py:750, in SpectralClustering.fit(self, X, y) 745 self.affinity_matrix_ = pairwise_kernels( 746 X, metric=self.affinity, filter_params=True, **params 747 ) 749 random_state = check_random_state(self.random_state) --> 750 self.labels_ = spectral_clustering( 751 self.affinity_matrix_, 752 n_clusters=self.n_clusters, 753 n_components=self.n_components, 754 eigen_solver=self.eigen_solver, 755 random_state=random_state, 756 n_init=self.n_init, 757 eigen_tol=self.eigen_tol, 758 assign_labels=self.assign_labels, 759 verbose=self.verbose, 760 ) 761 return self File ~/Library/Python/3.9/lib/python/site-packages/sklearn/cluster/_spectral.py:371, in spectral_clustering(affinity, n_clusters, n_components, eigen_solver, random_state, n_init, eigen_tol, assign_labels, verbose) 363 n_components = n_clusters if n_components is None else n_components 365 # We now obtain the real valued solution matrix to the 366 # relaxed Ncut problem, solving the eigenvalue problem 367 # L_sym x = lambda x and recovering u = D^-1/2 x. 368 # The first eigenvector is constant only for fully connected graphs 369 # and should be kept for spectral clustering (drop_first = False) 370 # See spectral_embedding documentation. --> 371 maps = spectral_embedding( 372 affinity, 373 n_components=n_components, 374 eigen_solver=eigen_solver, 375 random_state=random_state, 376 eigen_tol=eigen_tol, 377 drop_first=False, 378 ) 379 if verbose: 380 print(f"Computing label assignment using {assign_labels}") File ~/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:314, in spectral_embedding(adjacency, n_components, eigen_solver, random_state, eigen_tol, norm_laplacian, drop_first) 312 laplacian *= -1 313 v0 = _init_arpack_v0(laplacian.shape[0], random_state) --> 314 _, diffusion_map = eigsh( 315 laplacian, k=n_components, sigma=1.0, which="LM", tol=tol, v0=v0 316 ) 317 embedding = diffusion_map.T[n_components::-1] 318 if norm_laplacian: 319 # recover u = D^-1/2 x from the eigenvector output x File ~/Library/Python/3.9/lib/python/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1697, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode) 1695 with _ARPACK_LOCK: 1696 while not params.converged: -> 1697 params.iterate() 1699 return params.extract(return_eigenvectors) File ~/Library/Python/3.9/lib/python/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:560, in _SymmetricArpackParams.iterate(self) 558 else: 559 Bxslice = slice(self.ipntr[2] - 1, self.ipntr[2] - 1 + self.n) --> 560 self.workd[yslice] = self.OPa(self.workd[Bxslice]) 561 elif self.ido == 2: 562 self.workd[yslice] = self.B(self.workd[xslice]) File ~/Library/Python/3.9/lib/python/site-packages/scipy/sparse/linalg/_interface.py:232, in LinearOperator.matvec(self, x) 229 if x.shape != (N,) and x.shape != (N,1): 230 raise ValueError('dimension mismatch') --> 232 y = self._matvec(x) 234 if isinstance(x, np.matrix): 235 y = asmatrix(y) File ~/Library/Python/3.9/lib/python/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:944, in LuInv._matvec(self, x) 943 def _matvec(self, x): --> 944 return lu_solve(self.M_lu, x) File ~/Library/Python/3.9/lib/python/site-packages/scipy/linalg/_decomp_lu.py:147, in lu_solve(lu_and_piv, b, trans, overwrite_b, check_finite) 143 raise ValueError("Shapes of lu {} and b {} are incompatible" 144 .format(lu.shape, b1.shape)) 146 getrs, = get_lapack_funcs(('getrs',), (lu, b1)) --> 147 x, info = getrs(lu, piv, b1, trans=trans, overwrite_b=overwrite_b) 148 if info == 0: 149 return x KeyboardInterrupt:
# side by side plots of the clusters and the original data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15,6))
ax1.set_title('DBSCAN CLustering')
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='cluster', palette='coolwarm', ax=ax1)
ax2.set_title("Original")
sns.scatterplot(x='price', y='carat', data=dfCleaned, hue='clarity', palette='coolwarm', ax=ax2)
<Axes: title={'center': 'Original'}, xlabel='price', ylabel='carat'>
import umap
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# create a UMAP model with 2 dimensions
# n_neighbors default is 15
umapModel = umap.UMAP(n_components=2, n_neighbors=5, random_state=42, min_dist=0.1)
# fit the model to the data
manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn(
#! pip install pandas matplotlib datashader bokeh holoviews colorcet scikit-image #pip
#! pip install umap-learn[plot]
import umap.plot
y = dfCleaned['clarity'].values.flatten()
# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);
# Image size of 636x251086 pixels is too large. It must be less than 2^16 in each direction.
# plot the UMAP model with the colors genereated by the UMAP model
# plot different UMAP models with different parameters
fig, ax_array = plt.subplots(4, 4, figsize=(15, 15))
a = 0
b = 0
for n in [5, 10, 15, 20]:
for d in [0.1, 0.25, 0.5, 0.75]:
umapModel = umap.UMAP(n_components=2, n_neighbors=n, random_state=42, min_dist=d)
manifold = umapModel.fit(dfCleaned.drop('clarity', axis=1))
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000, ax=ax_array[a, b])
ax_array[a, b].set_title(f"n_neighbors={n}, min_dist={d}")
b += 1
a += 1
b = 0
# plot the UMAP model with the colors genereated by the UMAP model
umap.plot.points(manifold, labels=y, theme="fire", width=1500, height=1000);
/Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( /Users/petercatania/Library/Python/3.9/lib/python/site-packages/sklearn/manifold/_spectral_embedding.py:274: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn(